Summary : Properties of Enhancer-Gene Linking Methods


# This report shows the summary statistics for predictions.
This provides an overview of the properties of the candidate enhancer regions, enhancer regions used by the prediction methods. This will give us a better understanding of why certain methods might perform better under different validation strategies. Some of these statistics include:
Candidate Enhancer Regions are defined differently across the different methods, but is generally the FULL list of regions considered before selecting regulatory enhancer regions. Enhancer regions are regions that are known to regulate a gene.

  1. Number / Width / log2(DNase) / log2(H3K27ac) of Candidate Enhancer regions
  2. Number / Width / log2(DNase) / log2(H3K27ac) of Enhancer regions


Fig1. Number / Width / log2(DNase) / log2(H3K27ac) of Candidate Enhancer Regions and Enhancer Regions. Summary of properties of candidate enhancer regions and enhancer regions across all biosamples. A, Number of candidate enhancer regions and enhancer regions (bp) across all biosamples. B, Number of candidate enhancer regions and enhancer regions across all biosamples. C, log2(DNase) counts of candidate enhancer regions and enhancer regions across all biosamples. D, log2(H3K27ac) counts of candidate enhancer regions and enhancer regions across all biosamples.

Cumulative Density Plots comparing Width, Accessibility and Acetylation of Candidate regions and enhancer regions

Fig2. Cumulative Density Plots of Width, Accessibility and Acetylation for Candidate regions and enhancer regions. Width of candidate enhancer regions and enhancer regions ad defined by prediction methods. Accessibility and Acetylation for candidate enhancer regions and enhancer regions defined by prediction methods – as measured by DNase-seq/ATAC-seq and Histone Acetylation. Accessibility and Acetylation are calculated by getting the log2(DNase counts) / log2(H3K27ac counts) across candidate enhancer regions and enhancer regions.

Properties of Enhancer-Gene Connections



Fig3. Different Properties of Enhancer-Gene Regions. a, For a given region of the genome, the number of genes regulates across all biosamples. (top left) b, Given an enhancer (in a particular biosample), the number of biosamples that have an overlapping enhancer called. (top middle) c, The total unique basepairs are contained in the enhancers for each gene across all biosamples. (top right) d, For a given region of the genome, the number of genes it regulates across all biosamples. (bottom left) e, For a given region of the genome, the number of genes it regulates across all biosamples. (bottom right)

Methods

Comparison of enhancer-gene properties in enhancer regions and other enhancer-gene predictions
## Calculating Metrics Definitions of candidate and enhancer regions Candidate enhancer regions are defined as the full list of regions used, regardless of accessibility and activity of these regions. Enhancer regions are defined as the list of regions present in the final enhancer-gene links, filtering for links that are considered significant.
For the respective methods:
Candidate Enhancer Regions are defined as :
* ABC : Top 150k peaks on a DNase-Seq or ATAC-Seq bam file. Regions reized to fixed number of bp centered on peak summit. Blacklisted regions are removed, regions are merged.
* ChromHMM2017 : ChromHMM enhancer and promoter states to define active enhancers and promoters
* EpiMap : 3.5M+ DNaseI hypersensitive sites from Meuleman et al., (2020), which we overlay with H3K27ac activity and ChromHMM enhancer and promoter states to define active enhancers and promoters
Enhancer Regions are defined as :
* ABC : Enhancer regions that are linked to a gene and registers an ABC Score > 0.015
* ChromHMM2017 : Enhancer regions that are linked to a gene and registers a Score > 2.5
* EpiMap : Enhancer regions that are linked to a gene and registers a Score > 2.5

Number of candidate and enhancer regions We grouped candidate enhancer regions and enhancer regions by (Prediction,CellType) and removed duplicated regions, and counted the number of candidate and enhancer regions.

Width of candidate and enhancer regions We grouped candidate enhancer regions and enhancer regions by (Prediction,CellType) and removed duplicated regions. Using bedtools merge, we merged overlapping candidate enhancer regions and enhancer regions before calculating the width (end - start).

Accessibility of candidate and enhancer regions We downloaded DNase files across all the biosamples used for the predictors and calculated accessibility by reporting the log2(DNase read coverage) of regions via using bedtools coverage -counts.

Acetylation of candidate and enhancer regions We downloaded H3K27ac files across all the biosamples used for the predictors and calculated accessibility by reporting the log2(H3K27ac read coverage) of regions via using bedtools coverage -counts.

Number of Genes with a predicted enhancer in ANY biosample Using bedtools merge -c 4,4,5,5 -o collapse,count_distinct,collapse,count_distinct, where column 4 co rresponds to TargetGene and column 5 corresponds to Biosamples, we consolidated the number of genes and the number of biosamples that each enhancer region has. We counted the total number of unique genes that have a corresponding enhancer present in ANY biosample.

Number of Biosamples with similar Enhancer-Gene Links Using bedtools merge -c 4,4,5,5 -o collapse,count_distinct,collapse,count_distinct, where column 4 co rresponds to TargetGene and column 5 corresponds to Biosamples, we consolidated the number of genes and the number of biosamples that each enhancer region has. We used the column that counted distinct biosamples that have an overlapping enhancer region to report this metric.

Total number of unique basepair in all enhancers (bp) linked to a gene The total number of unique basepairs in all enhancers (bp) linked to a gene was calculated by grabbing all enhancers linked to a gene, merging overlapping enhancer regions and calculating the sum(enhancer_end - enhancer_start).

Number of Enhancer-Gene Connections per Gene across all Biosamples We consolidated the enhancer-gene connections across all biosamples for each prediction. For every gene, we calculated the number of unique Enhancer-Gene connections that are present in this consolidated file.

For a given region, number of genes it regulates across all biosamples We consolidated the enhancer-gene connections across all biosamples for each prediction. Using bedtools merge -c 4,4,5,5 -o collapse,count_distinct,collapse,count_distinct, where column 4 co rresponds to TargetGene and column 5 corresponds to Biosamples, we consolidated the number of genes and the number of biosamples that each enhancer region has. For every gene, we calculated the number of genes a given region regulates across all biosamples.

Prediction Methods Used

Enhancer-gene correlation (Activity-By-Contact)
DNase accessibility and acetylation of H3K27 are commonly used to identify enhancer elements, and are predictive of the expression of nearby genes and enhancer activity in plasmid-based reporter assays. The geometric mean of DNase-seq and H3K27ac ChIP–seq signals was used to calculate Activity. Contact component of the ABC score for E–G pairs from Hi-C data (when available) or the Average Hi-C data across 10 CellTypes, which was proven to an accurate substitute when celltype specific Hi-C data is not available. To calculate the relative effect of each element to the expression of a gene, we normalized the Activity by Contact of one element for a given gene to the sum of the Activity by Contact of other nearby elements. We included all elements within 5 Mb of the gene’s promoter in this calculation. Reference: https://www.nature.com/articles/s41588-019-0538-0

Enhancer-gene correlation (ChromHMM2017)
Gene expression was previously correlated with five active chromatin marks (H3K27ac, H3K9ac, H3K4me1, H3K4me2 and DNase I hypersensitivity) across 56 biosamples, and these correlation links were then used to make predictions for the predicted enhancers (regions with the ‘7Enh’ ChromHMM state) in 127 biosamples from the Roadmap Epigenome Atlas. We downloaded these predictions from www.biolchem.ucla.edu/labs/ ernst/roadmaplinking and made predictions using the confidence score. Reference: https://www.nature.com/articles/nprot.2017.124


Enhancer-gene correlation (EpiMap)
Gene expression was previously correlated with five active chromatin marks (H3K27ac, H3K9ac, H3K4me1, H3K4me2 and H3K4me3) across 304 biosamples. A negative set of correlations for each enhancer was computed using random genes in a different chromosome. We predicted links for each biosample and ChromHMM enhancer state separately (states E7, E8, E9, E10, E11 and E15). Predictions were made by training an XGBoost classifier on the positive set of all valid links against their paired negative links, using precomputed correlations and distance to the transcription start site as features, and keeping all links with a probability above 5/7. We downloaded these predictions from https://personal.broadinstitute.org/cboix/epimap/links/. Reference: https://www.biorxiv.org/content/10.1101/810291v2